Dataset Construction for Gene Structure Prediction and Alternative Splicing Analysis

نویسندگان

  • Masahiko Mizuno
  • Hideki Nagasaki
  • Makiko Suwa
چکیده

The performance of gene finding from genome sequences strongly depends on the accuracy of splice site prediction. Recent gene finding programs, however, still do not reach enough levels. To improve the accuracy of splice site prediction, it is required to understand the splicing mechanism and to make a model from clear experimental evidences. For this purpose, genomic full-length precursor mRNA sequences (FL-pre-mRNAs), together with expression information are indispensable. The FL-pre-mRNAs have entire gene structure such as the 5’ and 3’ end of mRNA, initiation codon, splice sites, stop codon, and polyadenylation signals, etc. They also contain all the alternative splice sites except the first or last exons in alternative transcripts. However, databases of FL-pre-mRNAs are still not reported in previous works. Aligning expressed sequence tags (ESTs) to the genomic sequences has been a common method for gene prediction or splice site analysis (1, 3). However, ESTs are not suitable for collecting FL-premRNAs because ESTs are partial sequences and the 5’ ends of mRNAs are unknown in most cases, and even EST contigs clustered in UniGene (2) or RefSeq database (4) are not evident to be full-length. It is because ESTs are single sequencing reads that contain mutations, insertions, or deletions (5). Growing genomic and EST sequence data, computational approach has become one of methods to annotate the sequences as putative genes or ORFs. Whereas, Genbank database has accumulated the entries in which genomic complete protein-coding sequences or full-length mRNA sequences are characterized by experimental evidence. The sequences and the annotation (the positions of gene boundaries and functional signals) with the information more reliable than that determined by in silico prediction are expected to be high quality. Thus, we constructed datasets with experimental annotation from Genbank database for gene structure prediction and splice site analysis. Moreover, the analysis for constitutive and alternative splice sites with the correlation with several biological descriptors will be discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Prediction Accuracy of Gene Structures in Eukaryotic DNA with Low C+G Contents

have developed a gene prediction program GeneKey. When trained with the widely used dataset lected by Kulp and Reese, GeneKey can achieve high prediction accuracy for genes with derate and high C+G contents. However, the prediction accuracy is much lower for CG-poor es. To tackle this problem, we construct a new LCG316 dataset composed of gene sequences h low C+G contents. For CG-poor genes, th...

متن کامل

Identification of alternative 50/30 splice sites based on the mechanism of splice site competition

Alternative splicing plays an important role in regulating gene expression. Currently, most efficient methods use expressed sequence tags or microarray analysis for large-scale detection of alternative splicing. However, it is difficult to detect all alternative splice events with them because of their inherent limitations. Previous computational methods for alternative splicing prediction coul...

متن کامل

Identification of alternative 5′/3′ splice sites based on the mechanism of splice site competition

Alternative splicing plays an important role in regulating gene expression. Currently, most efficient methods use expressed sequence tags or microarray analysis for large-scale detection of alternative splicing. However, it is difficult to detect all alternative splice events with them because of their inherent limitations. Previous computational methods for alternative splicing prediction coul...

متن کامل

SpliceInfo: an information repository for mRNA alternative splicing in human genome

We have developed an information repository named SpliceInfo to collect the occurrences of the four major alternative-splicing (AS) modes in human genome; these include exon skipping, 5'-alternative splicing, 3'-alternative splicing and intron retention. The dataset is derived by comparing the nucleotide and protein sequences available for a given gene for evidence of AS. Additional features su...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001